Lending Club

LendingClub is a peer-to-peer lending company headquartered in San Francisco California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC) and to offer loan trading on a secondary market. At its height LendingClub was the world’s largest peer-to-peer lending platform. []

_images/lendingclub.png

Fig. 1 lending club website

About Personal loan

A personal loan is a loan which can be taken to meet unspecified financial needs. Today personal loan segment has diverted into many specialised loans. It can be taken for various purpose such as a wedding, traveling, paying education fee, medical emergencies or any undefined reason etc. The interest paid on a personal loan is in most cases higher than that payable on secured loans. []

Business Understanding

Modified some of contents from [].

The LendingClub company lends various types of loans to their customers. The company will make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

  • If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company

  • If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.

The data given contains the information about past loan applicants and whether they ‘defaulted’ or not. The aim is to identify patterns which indicate if a person is likely to default, which may be used for takin actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.

When a person applies for a loan, there are two types of decisions that could be taken by the company:

  1. Loan accepted: If the company approves the loan, there are 3 possible scenarios described below:

    • Fully paid: Applicant has fully paid the loan (the principal and the interest rate)

    • Current: Applicant is in the process of paying the instalments, i.e. the tenure of the loan is not yet completed. These candidates are not labelled as ‘defaulted’.

    • Charged-off: Applicant has not paid the instalments in due time for a long period of time, i.e. he/she has defaulted on the loan

  2. Loan rejected: The company had rejected the loan (because the candidate does not meet their requirements etc.). Since the loan was rejected, there is no transactional history of those applicants with the company and so this data is not available with the company (and thus in this dataset)

Business Metric

TBD

  • It might be some explanation on why some of customer fails to pay the loan for reject purpose

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import hvplot.pandas
from pathlib import Path

main_path = Path().absolute().parent
data_path = main_path.parent / 'data' / 'p2p' / 'lending_club' / 'processed'

Exploratory Data Analysis

Data Description

LoanStatNew

Description

loan_amnt

The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

term

The number of payments on the loan. Values are in months and can be either 36 or 60.

loan_status

Current status of the loan.

int_rate

Interest Rate on the loan.

installment

The monthly payment owed by the borrower if the loan originates.

grade

LC assigned loan grade.

sub_grade

LC assigned loan subgrade.

emp_title

The job title supplied by the Borrower when applying for the loan.

emp_length

Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

home_ownership

The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER

annual_inc

The self-reported annual income provided by the borrower during registration.

verification_status

Indicates if income was verified by LC, not verified, or if the income source was verified.

issue_d

The month which the loan was funded.

purpose

A category provided by the borrower for the loan request.

title

The loan title provided by the borrower.

dti

A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrowe’s self-reported monthly income.

earliest_cr_line

The month the borrower’s earliest reported credit line was opened.

open_acc

The number of open credit lines in the borrower’s credit file.

pub_rec

Number of derogatory public records.

pub_rec_bankruptcies

Number of public record bankruptcies.

revol_bal

Total credit revolving balance.

revol_util

“Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.

total_acc

The total number of credit lines currently in the borrower’s credit file.

initial_list_status

The initial listing status of the loan. Possible values are – W, F

application_type

Indicates whether the loan is an individual application or a joint application with two co-borrowers.

mort_acc

Number of mortgage accounts.

addr_state

The state provided by the borrower in the loan application.

# import data
df = pd.read_csv( data_path / 'accepted.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1345310 entries, 0 to 1345309
Data columns (total 27 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   loan_amnt             1345310 non-null  float64
 1   term                  1345310 non-null  object 
 2   loan_status           1345310 non-null  object 
 3   int_rate              1345310 non-null  float64
 4   installment           1345310 non-null  float64
 5   grade                 1345310 non-null  object 
 6   sub_grade             1345310 non-null  object 
 7   emp_title             1259525 non-null  object 
 8   emp_length            1266799 non-null  object 
 9   home_ownership        1345310 non-null  object 
 10  annual_inc            1345310 non-null  float64
 11  verification_status   1345310 non-null  object 
 12  issue_d               1345310 non-null  object 
 13  purpose               1345310 non-null  object 
 14  title                 1328651 non-null  object 
 15  dti                   1344936 non-null  float64
 16  earliest_cr_line      1345310 non-null  object 
 17  open_acc              1345310 non-null  float64
 18  pub_rec               1345310 non-null  float64
 19  revol_bal             1345310 non-null  float64
 20  revol_util            1344453 non-null  float64
 21  total_acc             1345310 non-null  float64
 22  initial_list_status   1345310 non-null  object 
 23  application_type      1345310 non-null  object 
 24  mort_acc              1298029 non-null  float64
 25  pub_rec_bankruptcies  1344613 non-null  float64
 26  addr_state            1345310 non-null  object 
dtypes: float64(12), object(15)
memory usage: 277.1+ MB

loan status

Current status of the loan.

df['loan_status'].value_counts().hvplot.bar(
    title='Loan Status Counts', xlabel='Loan Status', ylabel='Count', 
    width=500, height=350, yformatter='%d'
)

loan_amnt & installment

  • loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

  • installment: The monthly payment owed by the borrower if the loan originates.

loan_amnt = df.hvplot.hist(
    y='loan_amnt', by='loan_status', subplots=False, 
    width=400, height=400, bins=50, alpha=0.4, title='Loan Amount', 
    xlabel='Loan Amount', ylabel='Counts', legend='top',
    yformatter='%d'
)
loan_amnt
installment = df.hvplot.hist(
    y='installment', by='loan_status', subplots=False, 
    width=400, height=400, bins=50, alpha=0.4, title='Installment', 
    xlabel='Installment', ylabel='Counts', legend='top',
    yformatter='%d'
)
installment
installment_box = df.hvplot.box(
    y='installment', subplots=True, by='loan_status', width=250, height=400, 
    title='Installment', xlabel='Loan Status', ylabel='Installment', legend=False
)

loan_amnt_box = df.hvplot.box(
    y='loan_amnt', subplots=True, by='loan_status', width=250, height=400, 
    title='Loan Amount', xlabel='Loan Status', ylabel='Loan Amount', legend=False
)

loan_amnt_box + installment_box

term & int_rate

  • term: The number of payments on the loan. Values are in months and can be either 36 or 60.

  • int_rate: Interest Rate on the loan.

term = df.groupby(['loan_status'])[['term']].value_counts().rename('Count').hvplot.bar()
term.opts(
    title="Term", xlabel='Term / Loan Status', ylabel='Count',
    width=500, height=450, show_legend=True, yformatter='%d'
)
int_rate = df.hvplot.hist(
    y='int_rate', by='loan_status', subplots=False, 
    width=400, height=400, bins=30, alpha=0.4, 
    title='Interest Rate', 
    xlabel='Interest Rate', ylabel='Counts', legend='top',
    yformatter='%d'
)
int_rate

usually sort-term has lower interest rate

df.loc[:, ['loan_status', 'term', 'int_rate']].hvplot.hist(
    y='int_rate', groupby='loan_status', by='term', subplots=False, 
    width=400, height=400, bins=30, alpha=0.4, 
    title='Interest Rate by term', xlabel='Interest Rate', ylabel='Counts', legend='top',
    yformatter='%d', dynamic=False
)

Interesting relation between interest rate and installment it that can be calculate by the following formula if using “Equal repayment of principal and interest”

def cal_amount_erpi(loan_amnt, int_rate, term):
    """
    loan_anmt: loan amount
    int_rate: interest rate, percentage
    term: in month
    """
    int_rate_monthly = int_rate / 100 / 12
    payment_monthly = loan_amnt * int_rate_monthly
    total_to_pay = payment_monthly * (1 + int_rate_monthly)**term
    return total_to_pay / ((1 + int_rate_monthly)**term - 1)
df_temp = df.loc[:, ['installment', 'loan_amnt', 'int_rate', 'term']].copy()
df_temp['term'] = df_temp['term'].str.strip('months').str.strip().astype(np.int32)
df_temp['installment_cal'] = df_temp.apply(lambda x: cal_amount_erpi(x['loan_amnt'], x['int_rate'], x['term']), axis=1)

not all the payment are following “Equal repayment of principal and interest”

df_temp_diff = (df_temp['installment_cal'] - df_temp['installment'])
df_diff = df_temp_diff.agg(['mean', 'std'])
print(f'Difference of Mean: {df_diff["mean"]:.4f} Standard Deviation {df_diff["std"]:.4f}')
df_temp_diff.loc[abs(df_temp_diff) > 1].hvplot.hist(
    subplots=False, width=400, height=400, bins=50, alpha=0.4, 
    title='Differences between calculated installment (Diff > 1)', xlabel='Diff', ylabel='Counts', legend='top',
    yformatter='%d',
)
Difference of Mean: 0.2274 Standard Deviation 7.1209

grade & sub_grade

  • grade: LC assigned loan grade.

  • sub_grade: LC assigned loan subgrade.

grade = df.groupby(['loan_status'])[['grade']].value_counts().sort_index().hvplot.bar(
    width=400, height=400, title='Grade Distribution', xlabel='Grade', ylabel='Count', 
    legend='top', yformatter='%d'
)
grade
sub_grade = df.groupby(['loan_status'])[['sub_grade']].value_counts().sort_index().hvplot.barh(
    width=400, height=800, title='Sub-Grade Distribution', xlabel='Sub-Grade', ylabel='Count', 
    legend='top', xformatter='%d'
)
sub_grade

usually people don’t charge off at grade A - B, usually C grade are more often charge off. but cannot say that the people who has lower grade pay less on their loan.

df.loc[df['grade'].isin(['E', 'F', 'G'])].groupby(['loan_status', 'grade'])[['grade']].value_counts().rename('count')\
    .hvplot.table(title='Grade Count in E-G')

home_ownership & purpose

  • home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER

  • purpose: A category provided by the borrower for the loan request.

df.groupby(['loan_status'])[['home_ownership']].value_counts().rename('count').hvplot.table()
home_ownership = df.groupby(['loan_status'])[['home_ownership']].value_counts().rename('Count').hvplot.bar()
home_ownership.opts(
    title='Home Ownership', xlabel='Home Ownership / Loan Status', ylabel='Count',
    width=700, height=450, show_legend=True, yformatter='%d'
)
purpose = df.groupby(['loan_status'])[['purpose']].value_counts().rename('Count').hvplot.bar()
purpose.opts(
    title="Purpose", xlabel='Purpose / Loan Status', ylabel='Count',
    width=700, height=450, show_legend=True, yformatter='%d', xrotation=90
)

annual_inc & verification_status

  • annual_inc: The self-reported annual income provided by the borrower during registration.

  • verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified

df.groupby(['loan_status', 'verification_status'])['annual_inc'].describe().round(2).hvplot.table(
    title='Annual Income Table Description By Loan Status & Verification', height=200, width=700)
(df.groupby(['loan_status'])[['verification_status']].value_counts() / df.groupby(['loan_status'])['verification_status'].count())\
    .rename('percentage').hvplot.table(title='Income Verified Rate', height=200)
def is_outlier(x): 
    iqr = np.percentile(x, 75) - np.percentile(x, 25)
    upper = np.percentile(x, 75) + (iqr * 1.5)
    lower = np.percentile(x, 25) - (iqr * 1.5)

    return (x > upper) | (x < lower)

annual_inc = df.loc[~df.groupby(['loan_status', 'verification_status'])['annual_inc'].apply(is_outlier), 
    ['loan_status', 'verification_status', 'annual_inc']].hvplot.hist(
    y='annual_inc', by='loan_status', groupby='verification_status', subplots=False, 
    width=700, height=400, bins=40, alpha=0.4, title='Annual Income(1Q~3Q +/- 1.5*IQR) Distsribution', 
    xlabel='Annual Income', ylabel='Counts', legend='top', yformatter='%d', xformatter='%d', dynamic=False
)
annual_inc
  • Q: What is the difference between verified and not verified who has much high/lower income?

df.loc[df.groupby(['loan_status', 'verification_status'])['annual_inc'].apply(is_outlier), 
['loan_status', 'verification_status', 'annual_inc']].groupby(['loan_status', 'verification_status'])['annual_inc'].describe()
count mean std min 25% 50% 75% max
loan_status verification_status
Charged Off Not Verified 2127.0 172166.797414 127377.440279 137004.0 148000.0 155000.0 180000.00 5604824.0
Source Verified 6584.0 224642.582775 284071.924266 146300.0 158000.0 180000.0 225000.00 9500000.0
Verified 4563.0 208832.507318 155431.846754 149512.0 160000.0 180000.0 220000.00 7500000.0
Fully Paid Not Verified 12973.0 183947.109903 113846.621947 149337.0 153000.0 167000.0 185000.00 7000000.0
Source Verified 23828.0 265642.868322 314363.519615 168000.0 186000.0 212000.0 264634.25 10999200.0
Verified 15122.0 259742.658187 150249.433294 171175.0 190000.0 219000.0 275000.00 6100000.0

emp_title & emp_length

  • emp_title: The job title supplied by the Borrower when applying for the loan.

  • emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

check_null = lambda x: x.isnull().sum()
df_emp_null = df.loc[:, ['emp_title', 'emp_length']].apply([check_null, pd.Series.nunique]).rename(index={'<lambda>': 'nnull'}).reset_index().hvplot.table(
    title='Job Title & Employment length: NA values', height=100
)
df_emp_top20 = df['emp_title'].value_counts().reset_index().rename(columns={'index': 'emp_title', 'emp_title': 'count'})[:20].hvplot.table(
    title='Job Title Top 20'
)
df_emp_null
df_emp_top20
df['emp_length'].fillna('unknown', inplace=True)
df['emp_title'].fillna('unknown', inplace=True)
df['emp_title'] = df['emp_title'].str.lower()  # Unify into lower cases
df_emp_top20 = df['emp_title'].value_counts().reset_index().rename(columns={'index': 'emp_title', 'emp_title': 'count'})[:20].hvplot.table(
    title='Job Title Top 20'
)
df_emp_top20
df_emp_bottom20 = df['emp_title'].value_counts().reset_index().rename(columns={'index': 'emp_title', 'emp_title': 'count'})[-20:].hvplot.table(
    title='Job Title Bottom 20'
)
df_emp_bottom20
print(df['emp_title'].nunique())
326450

titles are not normalized(or structured), too many unique titles in the data.

from itertools import product

loan_status_order = ['Charged Off', 'Fully Paid']
emp_length_order = ['unknown', '< 1 year', '1 year', '2 years', '3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years']
emp_length = df.groupby(['loan_status'])[['emp_length']].value_counts().reindex(list(product(*[loan_status_order, emp_length_order])))\
    .rename('Count').hvplot.barh(stacked=True, legend='right')
emp_length.opts(
    title='Employment Length in years', height=400, width=700, xlabel='Counts', ylabel='Employment Length in years', xformatter='%d'
)

issue_d & earliest_cr_line

  • issue_d: The month which the loan was funded.

  • earliest_cr_line: The month the borrower’s earliest reported credit line was opened.

Red is the people who charge-off and the blue is the people who fully paied. Most people try to do the loan near the 2016 and started to create their credit line at 2000.

import calendar

month_dict = {m: n for n, m in enumerate([calendar.month_abbr[i] for i in range(1, 13)], 1)}

df['issue_d'] = pd.to_datetime(df['issue_d'].str.split('-').apply(lambda x: f'{x[1]}-{month_dict.get(x[0])}'))
df['earliest_cr_line'] = pd.to_datetime(df['earliest_cr_line'].str.split('-').apply(lambda x: f'{x[1]}-{month_dict.get(x[0])}'))
fully_paid = df.loc[df['loan_status']=='Fully Paid', 'issue_d'].hvplot.hist(bins=35) 
charged_off = df.loc[df['loan_status']=='Charged Off', 'issue_d'].hvplot.hist(bins=35)

# fully_paid * charged_off
loan_issue_date = (fully_paid * charged_off).opts(
    title='Loan Issue Date Distribution', xlabel='Loan Issue Date', ylabel='Count',
    width=350, height=350, legend_cols=2, legend_position='top_right'
).opts(xrotation=45, yformatter='%d')

fully_paid = df.loc[df['loan_status']=='Fully Paid', 'earliest_cr_line'].hvplot.hist(bins=35) 
charged_off = df.loc[df['loan_status']=='Charged Off', 'earliest_cr_line'].hvplot.hist(bins=35)

earliest_cr_line = (fully_paid * charged_off).opts(
    title='Earliest reported credit line', xlabel='earliest_cr_line', ylabel='Count',
    width=350, height=350, legend_cols=2, legend_position='top_right'
).opts(xrotation=45, yformatter='%d')

loan_issue_date + earliest_cr_line
  • Q: Are there anyone who applied before the credit line is reported?

issue_report = df['issue_d'] < df['earliest_cr_line']
print(f'The percentage that who applied before the credit line is reported: {(issue_report).sum() / len(df)}')
The percentage that who applied before the credit line is reported: 0.0
  • Q: Are there any difference between months?

df['issue_d_month'] = df['issue_d'].dt.month

issue_d_month = df.groupby(['loan_status'])[['issue_d_month']].value_counts().rename('Count').hvplot.bar()
issue_d_month.opts(
    title="Issue Date Distribution in every month by Loan Status", xlabel='Month', ylabel='Count',
    width=700, height=450, show_legend=True, yformatter='%d'
)

title

title is duplicated with the purpose column, will drop it later

print(df['title'].isnull().sum())
16659
df['title'] = df['title'].str.lower()
df['title'].value_counts()[:10]
debt consolidation         682107
credit card refinancing    250295
home improvement            77604
other                       66862
major purchase              24334
medical expenses            13681
business                    11843
car financing               11245
vacation                     8233
moving and relocation        7952
Name: title, dtype: int64

dti, open_acc, total_acc

What Is Debt-to-Income Ratio?

Your debt-to-income ratio compares your debt payments to your monthly gross income, or how much you earn each month before taxes and other deductions. Your DTI ratio gives lenders a clearer picture of your current debt and income, and is used to determine how much money you can afford to responsibly borrow.

Monthly debt may include:

  • Minimum credit card payments

  • Loan payments (such as car payments, student loan payments, personal loans, and other loan payments)

  • Monthly alimony or child support payments

  • Rent payment or mortgage payments

  • Other debts included in your credit report

  • dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

  • open_acc: The number of open credit lines in the borrower’s credit file.

  • total_acc: The total number of credit lines currently in the borrower’s credit file.

NerdWallet website says that the maximum allowed DTI ratio is 40% for single applicants and 35% for joint applicants. In the Lending Club website, seems like over 40% DTI is not a good signal, they suggest some way to improve the DTI ratio.

df['dti'].describe().reset_index().hvplot.table(title='DTI Table Description', height=250)
# Can DTI be 999?

It seems like the DTI over 60 can be treated as outlier data, may need to drop them.

df.loc[df['dti'] > 40].groupby(['loan_status'])['dti'].describe().hvplot.table(title='DTI > 40% Table Description', height=100)
dti_sub = df.loc[df['dti'] <= 40].hvplot.hist(
    y='dti', by='loan_status', bins=50, width=400, height=350, subplots=False, 
    title="dti(<=50) Distribution", xlabel='dti', ylabel='Count', shared_axes=False,
    alpha=0.4, legend='top', yformatter='%d'
)

dti_sub2 = df.loc[df['dti'] > 40].hvplot.hist(
    y='dti', by='loan_status', bins=100, width=400, height=350, subplots=False, 
    title="dti(>50) Distribution", xlabel='dti', ylabel='Count', shared_axes=False,
    alpha=0.4, legend='top', yformatter='%d'
)

dti_sub + dti_sub2
open_acc = df.hvplot.hist(
    y='open_acc', by='loan_status', bins=50, width=450, height=350, 
    title='The number of open credit lines', xlabel='The number of open credit lines', ylabel='Count', 
    alpha=0.4, legend='top', yformatter='%d'
)

total_acc = df.hvplot.hist(
    y='total_acc', by='loan_status', bins=50, width=450, height=350, 
    title='The total number of credit lines', xlabel='The total number of credit lines', ylabel='Count', 
    alpha=0.4, legend='top', yformatter='%d'
)

open_acc + total_acc

revol_bal & revol_util

What is revolving balance?

In credit card terms, a revolving balance is the portion of credit card spending that goes unpaid at the end of a billing cycle.

https://www.capitalone.com/learn-grow/money-management/revolving-credit-balance/

  • revol_bal: Total credit revolving balance.

  • revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.

df.groupby(['loan_status'])['revol_bal'].describe().round(2).reset_index().hvplot.table(title='Revolving Balance Table Description', height=100)
revol_bal = df.hvplot.hist(
    y='revol_bal', by='loan_status', bins=50, width=350, height=400, 
    title='Revolving Balance', xlabel='Revolving balance', ylabel='Count', 
    alpha=0.4, legend='top', yformatter='%d', xformatter='%d'
).opts(xrotation=45)

revol_bal_sub = df.loc[df['revol_bal']<=250000].hvplot.hist(
    y='revol_bal', by='loan_status', bins=50, width=350, height=400, 
    title='Revolving Balance(<=250000)', xlabel='Revolving balance', ylabel='Count', 
    alpha=0.4, legend='top', yformatter='%d', xformatter='%d', shared_axes=False
)
revol_bal + revol_bal_sub
revol_util = df.hvplot.hist(
    y='revol_util', by='loan_status', bins=50, width=350, height=400, 
    title='Revolving line utilization rate', xlabel='Revolving line utilization rate', ylabel='Count', 
    alpha=0.4, legend='top'
).opts(yformatter='%d')

revol_util_sub = df[df['revol_util'] < 120].hvplot.hist(
    y='revol_util', by='loan_status', bins=50, width=350, height=400, 
    title='Revolving line utilization rate (< 120)', xlabel='Revolving line utilization rate', ylabel='Count', 
    shared_axes=False, alpha=0.4, legend='top'
).opts(yformatter='%d')

revol_util + revol_util_sub

pub_rec, pub_rec_bankruptcies & mort_acc

What is derogatory record?

A derogatory public record is negative information on your credit report that is of a more serious nature and has become a matter of public record. It usually consists of bankruptcy filings, civil court judgments, foreclosures and tax liens. In some states, child support delinquencies are also a matter of public record.

https://budgeting.thenest.com/derogatory-public-record-mean-25266.html

What is a mortgage?

The mortgage refers to a loan used to purchase or maintain a home, land, or other types of real estate. Usually paying the mortgage consistently will increse the credit score.

https://www.investopedia.com/terms/m/mortgage.asp https://www.investopedia.com/articles/personal-finance/031215/how-mortgages-affect-credit-scores.asp

  • pub_rec: Number of derogatory public records.

  • pub_rec_bankruptcies: Number of public record bankruptcies.

  • mort_acc: Number of mortgage accounts.

From the data we can process these data as binary who had never have a public record versus more than once.

pub_rec = df.groupby(['loan_status'])['pub_rec'].value_counts().rename('Count').hvplot.barh(
    title='The number of derogatory', xlabel='The number of derogatory', ylabel='Count',
    width=400, height=800, xformatter='%d'
)
pub_rec
pub_rec_bankruptcies = df.groupby(['loan_status'])['pub_rec_bankruptcies'].value_counts().rename('Count').hvplot.barh(
    title='The number of public record bankruptcies', xlabel='The number of public record bankruptcies', ylabel='Count',
    width=400, height=600, xformatter='%d'
)
pub_rec_bankruptcies
df['mort_acc'].describe().round(2)
count    1298029.00
mean           1.67
std            2.00
min            0.00
25%            0.00
50%            1.00
75%            3.00
max           51.00
Name: mort_acc, dtype: float64
mort_acc = df.groupby(['loan_status'])['mort_acc'].value_counts().rename('Count').hvplot.barh(
    title='The number of mortgage accounts', xlabel='The number of mortgage accounts', ylabel='Count',
    width=400, height=700, xformatter='%d'
)

print(df['mort_acc'].isnull().sum())

mort_acc
47281

initial_list_status, application_type & addr_state

  • initial_list_status: The initial listing status of the loan. Possible values are – W, F

  • application_type: Indicates whether the loan is an individual application or a joint application with two co-borrowers.

  • addr_state: The state provided by the borrower in the loan application.

initial_list_status = df.groupby(['loan_status'])['initial_list_status'].value_counts().rename('Count').hvplot.bar(
    title='The initial listing status of the loan', xlabel='The initial listing status of the loan', ylabel='Count',
    width=400, height=400, yformatter='%d'
)
initial_list_status
application_type = df.groupby(['loan_status'])['application_type'].value_counts().rename('Count').hvplot.bar(
    title='The application type', xlabel='The application type', ylabel='Count',
    width=400, height=400, yformatter='%d'
)
application_type
addr_state = df.groupby(['loan_status'])['addr_state'].value_counts().rename('Count').hvplot.barh(
    title='The state provided by the borrower', xlabel='The state', ylabel='Count',
    width=500, height=850, xformatter='%d', legend='right'
)
addr_state

Data Preprocessing

  • Drop columns

  • Missing values

  • Detecting outlieres

# reload the data
df = pd.read_csv( data_path / 'accepted.csv')
print(f'Data shape: {df.shape}')
print(df.columns)
Data shape: (1345310, 27)
Index(['loan_amnt', 'term', 'loan_status', 'int_rate', 'installment', 'grade',
       'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
       'verification_status', 'issue_d', 'purpose', 'title', 'dti',
       'earliest_cr_line', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util',
       'total_acc', 'initial_list_status', 'application_type', 'mort_acc',
       'pub_rec_bankruptcies', 'addr_state'],
      dtype='object')

According to our EDA, We will not use following columns:

  • title: duplicated with purpose

  • emp_title: too many unique jobs, but seems like some of them are duplicated

  • issue_d, earliest_cr_line: nothing interesting

Other columns

  • term: change its’ type to integer

  • grade, sub_grade, home_ownership, purpose, initial_list_status, application_type, addr_state: do label encoding

  • emp_length: add ‘unknown’ for NaN values and do the label encoding

  • verification_status: convert source verified as verified together and do the label encoding

  • dti: drop which dti over 60%

  • revol_bal: drop which has over $ 250,000 balance

  • revol_util: drop which has over 120% utilization rate

  • pub_rec, pub_rec_bankruptcies: convert as binary, who has ever had the record or not

  • mort_acc: can convert to categories [0, 1, 2, 3, 4, 5, 5+] and do the label encoding

  • loan_status: target column, do the label encoding

from collections import defaultdict

# drop columns values
df.drop(columns=['title', 'emp_title', 'issue_d', 'earliest_cr_line'], inplace=True)
print(f"- Dropped columns: {['title', 'emp_title', 'issue_d', 'earliest_cr_line']}")

# term
df['term'] = df['term'].str.rstrip('months').astype(int)
print('- Type changed into integer: term')

# need a label encoding
encode_dict = defaultdict(dict)
for c in ['grade', 'sub_grade', 'home_ownership', 'purpose', 'initial_list_status', 'application_type', 'addr_state']:
    encode_dict[c]['v2i'] = {v: i for i, v in enumerate(sorted(df[c].unique()))}
    encode_dict[c]['i2v'] = {i: v for v, i in encode_dict[c]['v2i'].items()}
    df[c] = df[c].map(encode_dict[c]['v2i'])
    print(f'- Label encoded: {c}')

# emp_length
c = 'emp_length'
df[c].fillna('unknown', inplace=True)
emp_values = ['< 1 year', '1 year', '2 years', '3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years']
encode_dict[c]['v2i'] = {v: i for i, v in enumerate(emp_values)}
encode_dict[c]['v2i']['unknown'] = 99
encode_dict[c]['i2v'] = {i: v for v, i in encode_dict[c]['v2i'].items()}
df[c] = df[c].map(encode_dict[c]['v2i'])
print(f'- Label encoded: {c}')

# verification_status
c = 'verification_status'
df[c].replace(to_replace='Source Verified', value='Verified', inplace=True)
print(f'- Merged "Source Verified" into "Verified": {c}')
veri_values = ['Not Verified', 'Verified']
encode_dict[c]['v2i'] = {v: i for i, v in enumerate(veri_values)}
encode_dict[c]['i2v'] = {i: v for v, i in encode_dict[c]['v2i'].items()}
df[c] = df[c].map(encode_dict[c]['v2i'])
print(f'- Label encoded: {c}')

# dti, revol_bal, revol_util
for c, thres in zip(['dti', 'revol_bal', 'revol_util'], [60, 250000, 120]):
    drop_idx = df.index[df[c] > thres]
    df.drop(index=drop_idx, inplace=True)
    print(f'- Dropped # of {(len(drop_idx)/len(df))*100:.2f}% ({len(drop_idx)}) data: {c}')

# pub_rec, pub_rec_bankruptcies
for c in ['pub_rec', 'pub_rec_bankruptcies']:
    df[c].apply(lambda x: 0 if x == 0 else 1)
    print(f'- Convert into binary feature data: {c}')

# mort_acc
c = 'mort_acc'
df.loc[df[c] > 5, c] = 6
mort_values = ['0', '1', '2', '3', '4', '5', '5+']
encode_dict[c]['v2i'] = {v: i for i, v in enumerate(mort_values)}
encode_dict[c]['i2v'] = {i: v for v, i in encode_dict[c]['v2i'].items()}
df.loc[df[c] > 5, c] = 6
print(f'- Label encoded: {c}')

# target column encoding
c = 'loan_status'
encode_dict[c]['v2i'] = {'Fully Paid': 0, 'Charged Off': 1}
encode_dict[c]['i2v'] = {i: v for v, i in encode_dict[c]['v2i'].items()}
df[c] = df[c].map(encode_dict[c]['v2i'])
print(f'- Label encoded: {c}')

df.reset_index(drop=True, inplace=True)
- Dropped columns: ['title', 'emp_title', 'issue_d', 'earliest_cr_line']
- Type changed into integer: term
- Label encoded: grade
- Label encoded: sub_grade
- Label encoded: home_ownership
- Label encoded: purpose
- Label encoded: initial_list_status
- Label encoded: application_type
- Label encoded: addr_state
- Label encoded: emp_length
- Merged "Source Verified" into "Verified": verification_status
- Label encoded: verification_status
- Dropped # of 0.13% (1714) data: dti
- Dropped # of 0.12% (1591) data: revol_bal
- Dropped # of 0.01% (116) data: revol_util
- Convert into binary feature data: pub_rec
- Convert into binary feature data: pub_rec_bankruptcies
- Label encoded: mort_acc
- Label encoded: loan_status

Check missing data

for column in df.columns:
    missing_col = df[column].isnull().sum()
    if missing_col != 0:
        missing_percentage = (missing_col / len(df)) * 100
        print(f"'{column}': number of missing values {missing_col}({missing_percentage:.3f}%)")
'dti': number of missing values 373(0.028%)
'revol_util': number of missing values 853(0.064%)
'mort_acc': number of missing values 47281(3.523%)
'pub_rec_bankruptcies': number of missing values 697(0.052%)

Since the data is not that much we will drop these records.

for c in ['dti', 'revol_util', 'mort_acc', 'pub_rec_bankruptcies']:
    drop_idx = df.index[df[c].isnull()]
    df.drop(index=drop_idx, inplace=True)
df.reset_index(drop=True, inplace=True)
df_loan_status_counts = df['loan_status'].value_counts()
print(f'The total number of records is {len(df)}')
print(f"- loan_status = Fully Paid: {df_loan_status_counts.iloc[0]}")
print(f"- loan_status = Charged Off: {df_loan_status_counts.iloc[1]}")
The total number of records is 1293445
- loan_status = Fully Paid: 1032730
- loan_status = Charged Off: 260715
# Save the processed data
# import pickle
# with (data_path / 'encode_dict.pickle').open('wb') as file:
#     pickle.dump(encode_dict, file)
# df.to_csv(data_path / 'accepted_processed.csv', index=False, encoding='utf-8')

Modeling

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load the processed data
df = pd.read_csv(data_path / 'accepted_processed.csv')
y = df['loan_status']
X = df.loc[:, ~df.columns.isin(['loan_status'])]
spiliter = StratifiedShuffleSplit(n_splits=1, test_size=0.1)
train_idx, test_idx = list(*spiliter.split(X, y))
X_train, y_train = X.loc[train_idx], y.loc[train_idx]
X_test, y_test = X.loc[test_idx], y.loc[test_idx]
model = RandomForestClassifier(n_jobs=8, verbose=1)
model.fit(X_train, y_train)
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:   20.2s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:   52.2s finished
RandomForestClassifier(n_jobs=8, verbose=1)
y_pred = model.predict(X_test)
rpt = classification_report(y_true=y_test, y_pred=y_pred)
print(rpt)
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.3s
              precision    recall  f1-score   support

           0       0.81      0.98      0.89    103273
           1       0.53      0.08      0.14     26072

    accuracy                           0.80    129345
   macro avg       0.67      0.53      0.52    129345
weighted avg       0.75      0.80      0.74    129345
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.9s finished

Explanation on Models

import shap
shap.initjs()
np.random.seed(78)
# pick 20 right predictions for each class
y_correct = y_test[y_test == y_pred]
num_samples = 10
rnd_idx_0 = np.random.choice(y_correct[y_correct == 0].index, num_samples) 
rnd_idx_1 = np.random.choice(y_correct[y_correct == 1].index, num_samples)
rnd_idx = np.concatenate((rnd_idx_0, rnd_idx_1))
y_sampled = y_correct.loc[rnd_idx]
X_sampled = X_test.loc[rnd_idx]
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_sampled, y_sampled)
shap.summary_plot(shap_values, X_sampled, alpha=0.8, color_bar=True)
_images/LendingClub_94_0.png
label = 1
shap.dependence_plot('sub_grade', shap_values[label], features=X_sampled)
_images/LendingClub_95_0.png
shap_values[1].shape
(20, 22)
label = 0
sample_index = 0
shap.force_plot(explainer.expected_value[label], shap_values[label][sample_index], features=X_sampled.iloc[sample_index])
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
label = 1
sample_index = 10
shap.force_plot(explainer.expected_value[label], shap_values[label][sample_index], features=X_sampled.iloc[sample_index])
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.